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Abstract. Since Stein's 1956 seminal paper, shrinkage has played a fun- 
damental role in both parametric and nonparametric inference. This 
article discusses minimaxity and adaptive minimaxity in nonparamet- 
ric function estimation. Three interrelated problems, function estima- 
tion under global integrated squared error, estimation under pointwise 
squared error, and nonparametric confidence intervals, are considered. 
Shrinkage is pivotal in the development of both the minimax theory 
and the adaptation theory. 

While the three problems are closely connected and the minimax 
theories bear some similarities, the adaptation theories are strikingly 
different. For example, in a sharp contrast to adaptive point estimation, 
in many common settings there do not exist nonparametric confidence 
intervals that adapt to the unknown smoothness of the underlying func- 
tion. A concise account of these theories is given. The connections as 
well as differences among these problems are discussed and illustrated 
through examples. 

Key words and phrases: Adaptation, adaptive estimation, Bayes min- 
imax, Besov ball, block thresholding, confidence interval, ellipsoid, in- 
formation pooling, linear functional, linear minimaxity, minimax, non- 
parametric regression, oracle, separable rules, sequence model, shrink- 
age, thresholding, wavelet, white noise model. 
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1. INTRODUCTION 

The multivariate normal mean model 

i.i.d. 



(1) 



.1 ; 



+ CTZi, 



iV(0,l) 



l,...,m, 



occupies a central position in parametric inference. 
In his seminal paper, Stein (1956) showed that, when 
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the dimension m > 3, the usual maximum likelihood 
estimator Y = (jji) of the normal mean is inadmissi- 
ble under mean squared error 

(2) R{ e,e) = -yE{e l -e l )\ 



in 






and demonstrated that significant gain can be achie- 
ved by using shrinkage estimators. Since then shrink- 
age has become an indispensable technique in sta- 
tistical inference, both in parametric and nonpara- 
metric settings. 

This article considers mininraxity and adaptive mi- 
nimaxity in nonparametric function estimation. Spe- 
cifically, we discuss three interrelated problems: func- 
tion estimation under global integrated squared er- 
ror, estimation under pointwise squared error, and 
nonparametric confidence intervals. The goal is to gi- 
ve a concise account of important results in both the 
minimax theory and adaptation theory for each prob- 
lem. The connections as well as differences among 
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these problems will be discussed and illustrated 
through examples. Shrinkage methods, including lin- 
ear shrinkage, separable rules, thresholding and bloc- 
kwise James-Stein procedures, figure prominently in 
the discussion. 

A primary focus in nonparametric function esti- 
mation is the construction of adaptive procedures. 
The goal of adaptive inference is to construct a sin- 
gle procedure that achieves optimality simultane- 
ously over a collection of parameter spaces. Infor- 
mally an adaptive procedure automatically adjusts 
to the smoothness properties of the underlying func- 
tion. A common way to evaluate such a procedure 
is to compare its maximum risk over each param- 
eter space in the collection with the corresponding 
minimax risk. 

As a step toward the goal of adaptive inference, 
one should first focus attention on the more concrete 
goal of developing a minimax theory over a given 
parameter space. This theory is now well developed 
particularly in the white noise with drift model: 

(3) dY{t) = f{t)dt + n- l ' 2 dW{t), < t < 1, 

where W(t) is a standard Brownian motion. This 
canonical white noise model is asymptotically equiv- 
alent to the conventional nonparametric regression 
where one observes (xk,yk) with 



Vk = f(x k ) + z k , 



z k U ~ N(0,1), fc = l, 



i.i.d. 



where x k = k/n in the fixed design case and x k 
Uniform(0, 1) in the case of random design. The 
parameter n in the white noise model (3) corre- 
sponds to the sample size in the regression model. 
See Brown and Low (1996a) and Brown et al. (2002). 
There is also a slightly less direct equivalence to den- 
sity estimation and spectrum estimation. See Nuss- 
baum (1996), Klemela and Nussbaum (1999) and 
Brown et al. (2004). 

Let {(3i{t),i 6 1} be an orthonormal basis 
of L 2 [0,1] and let y { = J &(<) dY n {t) and 0< = 
f f(t)/3i(t)dt. Then the white noise model (3) is 
equivalent to the following infinite-dimensional Gaus 
sian sequence model 



(4) y t 



+ n 



-1/2 



Z'i, : 



Zi l ~- N '(0,1), iel. 



An estimator 9 of the mean sequence 8 directly pro- 
vides an estimator f(t) = Yli€T@iPi(t) °f the func- 
tion / in the white noise model and vice versa. 
Hence, the function estimation model is closely re- 
lated to the classical multivariate normal mean mo- 
del (1). In these infinite-dimensional problems it is 



necessary to restrict the parameter set to be a com- 
pact subset of £ 2 , the space of square summable se- 
quences (or a compact subset of L 2 , the space of 
square integrable functions, in the case of the white 
noise model). In contrast, the parameter set in the 
finite dimensional problem is typically all of M m . 

Two of the most common ways of evaluating the 
performance of nonparametric function estimators 
are integrated squared error and pointwise squared 
error. Integrated squared error is used as a global 
measure of accuracy whereas pointwise squared er- 
ror gives a local measure of loss. Minimax theory 
for both of these cases has been developed. We shall 
begin our discussion on minimax theory for esti- 
mation under integrated squared error. What fol- 
lows will be elaborated in Section 2. Pinsker (1980) 
made a major breakthrough in nonparametric func- 
tion estimation theory by giving a complete and 
explicit solution to the problem of minimax esti- 
mation over an ellipsoid under integrated squared 
error loss. Pinsker derived the minimax linear esti- 
mator and showed that the minimax risk is equal 
to the linear minimax risk asymptotically. Together 
these results yield the first precise evaluation of the 
asymptotic minimax risk in nonparametric function 
estimation. Donoho, Liu and McGibbon (1990) con- 
sidered certain more general quadratically convex 
parameter spaces and showed that the linear min- 
imax risk is within a small constant of the mini- 
max risk. Furthermore, they also showed the limita- 
tions of linear procedures when the parameter space 
is not quadratically convex. Donoho and Johnstone 
(1998) studied minimax estimation over Besov balls 
which include cases that are not quadratically con- 
vex. Besov spaces are a very rich class of function 
spaces that are commonly used to model functions 
of inhomogeneous smoothness in functional analy- 
sis, statistics and signal processing. They also con- 
tain as special cases many traditional smoothness 
spaces such as Holder and Sobolev spaces. The re- 
sults of Donoho and Johnstone marked another ma- 
jor advance in the minimax estimation theory. In 
this setting it is shown that nonlinearity is essential 
for achieving minimaxity or even the minimax rate. 
Moreover, it is shown that the risk of the optimal co- 
ordinatewise thresholding rule is within a constant 
factor of the minimax risk. 

The problem of estimating a function under point- 
wise squared error will be discussed in Section 4. 
This problem can be considered as a special case of 
estimating a linear functional. The minimax theory 
for estimating a linear functional over a convex pa- 
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rameter space has been well developed in Ibragimov 
and Hasminskii (1984), Donoho and Liu (1991) and 
Donoho (1994). In particular, the minimax difficulty 
of estimation is captured by a geometric quantity, 
the modulus of continuity, and the optimal linear 
shrinkage estimator is within a 1.25 multiple of the 
minimax risk. Cai and Low (2004a) extended this 
minimax theory to nonconvex parameter spaces. In 
this case, although the minimax rate of convergence 
is still determined by the modulus of continuity, op- 
timal linear procedures can be arbitrarily far from 
being minimax and nonlinearity is necessary for min- 
imax estimation. 

The theory of adaptive estimation depends strong- 
ly on how risk is measured. When the performance 
is measured globally sharp adaptation can often be 
achieved. That is, one can attain the minimax risk 
over a collection of parameter spaces simultaneously. 
In particular, Efromovich and Pinsker (1984) con- 
structed sharp adaptive estimators over a range of 
Sobolev spaces. Recent results on rate adaptive es- 
timators focus on the more general Besov spaces. 
See, for example, Donoho and Johnstone (1995), Cai 
(1999), Johnstone and Silverman (2005) and Zhang 
(2005). In particular, Zhang (2005) developed gen- 
eral empirical Bayes methods which are asymptoti- 
cally sharp minimax simultaneously over a wide col- 
lection of Besov balls. Adaptive estimation under 
the global loss will be discussed in Section 3. While 
separable rules are optimal for minimax estimation, 
they cannot be rate adaptive. Information pooling is 
a necessity for achieving adaptivity. Block threshold- 
ing provides a convenient and effective tool for infor- 
mation pooling. We discuss in detail block thresh- 
olding rules via the approach of ideal adaptation 
with an oracle. Through block thresholding, many 
shrinkage estimators developed in the normal deci- 
sion theory can be used for nonparametric function 
estimation. In this sense block thresholding serves as 
a bridge between the classical theory and the mod- 
ern function estimation theory. 

Under pointwise risk it is often the case that sharp 
adaptation is not possible and a penalty, usually 
a logarithmic factor, must be paid for not know- 
ing the smoothness. Important work in this area 
began with Lepski (1990) where attention focused 
on a collection of Lipschitz classes. Brown and Low 
(1996b) obtained similar results using a constrained 
risk inequality, Tsybakov (1998) investigated point- 
wise adaptation over Sobolev classes and Cai (2003) 
considered Besov spaces. Efromovich and Low (1994) 
studied estimation of linear functionals over a nested 



sequence of symmetric sets. A general adaptation 
theory for estimating linear functionals is given in 
Cai and Low (2005a). This theory gives a geomet- 
ric characterization of the adaptation problem anal- 
ogous to that given by Donoho (1994) for mini- 
max theory. The adaptation theory describes ex- 
actly when rate adaptive estimators exist and when 
they do not exist the theory provides a general con- 
struction of estimators with the minimum adapta- 
tion cost. 

In addition to point estimation, confidence sets 
also play a fundamental role in statistical inference. 
The construction of nonparametric confidence sets 
is an important and challenging problem. In Sec- 
tion 5 we consider nonparametric confidence sets 
with a particular focus on confidence intervals. Other 
confidence sets such as confidence balls and con- 
fidence bands have also been discussed in the lit- 
erature. A minimax theory of confidence intervals 
for linear functionals was given in Donoho (1994) 
when the parameter space is assumed to be con- 
vex. Donoho (1994) constructed fixed length inter- 
vals centered at linear estimators which have length 
within a small constant factor of the minimax ex- 
pected length. Cai and Low (2004b) extended the 
minimax theory for parameter spaces that are finite 
unions of convex sets. In this case it is shown that 
optimal confidence intervals centered at linear esti- 
mators can have expected length much larger than 
the minimax expected length. It is thus essential to 
center the confidence interval at a nonlinear estima- 
tor in order to achieve minimaxity over nonconvex 
parameter spaces. 

An adaptation theory for confidence intervals was 
developed in Cai and Low (2004a). When atten- 
tion is focused on adaptive inference there are some 
striking differences between adaptive estimation and 
adaptive confidence intervals. As mentioned earlier, 
sharp adaptation is often possible under integrated 
squared error and the cost of adaptation is typi- 
cally a logarithmic factor under pointwise squared 
error. In contrast, in many common cases the cost 
of adaptation for confidence intervals is so high that 
adaptation becomes essentially impossible. 

There is also a conspicuous difference between con- 
fidence intervals in parametric and nonparametric 
settings. To construct a confidence interval in para- 
metric inference, a virtually universal technique is 
to first derive an optimal estimator of a parameter 
and then construct a confidence interval centered at 
this optimal estimator. It is often the case that such 
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a method leads to an optimal confidence interval for 
the parameter. This is also a common practice in 
nonparametric function estimation. However, some- 
what surprisingly, centering confidence intervals at 
optimally adaptive estimators in general yield sub- 
optimal confidence procedures (Cai and Low, 2005c): 
Either the resulting interval has poor coverage prob- 
ability or it is unnecessarily long. 

The paper is organized as follows. We begin with 
minimax estimation under global integrated squared 
error loss. Section 2 focuses on the important re- 
sults developed in Pinsker (1980), Donoho, Liu and 
McGibbon (1990) and Donoho and Johnstone (1998) 
on linear minimaxity, separable rules and minimax- 
ity. Section 3 considers adaptive estimation under 
the global loss. The performance of separable rules is 
studied in the context of adaptive estimation. The re- 
sults show that separable rules cannot be rate adap- 
tive and information pooling is essential for adap- 
tive estimation. We then discuss block thresholding 
rules using an oracle approach. Section 4 considers 
minimax and adaptive estimation under pointwise 
squared error loss and the construction of minimax 
and adaptive confidence intervals is treated in Sec- 
tion 5. The paper is concluded with discussions in 
Section 6. 

2. LINEAR MINIMAXITY, SEPARABLE 
RULES AND MINIMAXITY 

Minimax theory has been well developed in the 
Gaussian sequence model (and, equivalently, the whi- 
te noise model). Two classes of estimators, namely, 
linear shrinkage rules and separable rules, figure pro- 
minently in the development of the theory. In this 
section we consider minimax estimation under global 
mean integrated squared error (MISE) 



and the benchmark is the minimax risk 



(5) 



R(f,f) = E f \\f-f\\l 



(f(t)-f(t)fdt 

/o 



for the function estimation model (3) and 

R(d,e) = E 9 \\e-e\\l 

for the sequence estimation model (4). Because of 
the isometry of the risks R(f, f) = R(0, 9) we shall 
focus on the sequence model (4) in this section. 
The performance of an estimator 6 over a parameter 
set T is measured by its maximum risk 

R n (e,T) = supE e \\6-6g 



K(?) 



inf supE^H^ 

0S.F 



When attention is restricted to linear procedures, 
we consider the linear minimax risk 



RL(F)= inf su V E \\e 

6 linear d£ T 



In this section we give a concise account of some 
of the most important results in the minimax esti- 
mation theory without getting into too much techni- 
cal detail. We refer interested readers to Iain John- 
stone's monograph (Johnstone, 2002) for a detailed 
discussion on these and other related results. 

2.1 Linear Minimaxity 

Linear estimators and linear minimax risk occupy 
a special place in the development of nonparamet- 
ric function estimation theory. Linear procedures 
are appealing because of their simplicity and lin- 
ear minimax risk is easier to evaluate than the min- 
imax risk. For example, for linear estimation over 
solid and orthosymmetric parameter spaces it suf- 
fices to focus on simple diagonal linear estimators of 
the form 0i = Wii/i where Wi is a constant. Further- 
more, in many settings the optimal linear procedure 
is asymptotically minimax or within a small con- 
stant of the minimax risk. See, for example, Pinsker 
(1980) and Donoho, Liu and McGibbon (1990). In 
this section we shall follow the historical develop- 
ment of the linear minimax theory by discussing the 
theory in the order of ellipsoids, quadratically con- 
vex classes and Besov classes. 

Linear minimaxity over ellipsoids Pinsker (1980) 
considered minimax estimation over an ellipsoid 



(6) 



7 



9:J2^°f< M 



i=l 



where a, > and a« — > oo. Since the ellipsoid T is 
symmetric, the linear minimax risk is attained by 
the optimal diagonal linear estimator of the form 
6{w) = (wii/i) where w = (wi) € £ 2 with < Wi < 1 
is a sequence of weights. That is, 

(7) R^{T)=m(supE g \\e(w)-9\\l 



The RHS of (7) is easy to evaluate. Note that 



Ea 



(w 



oo 

*=1 



n Wi 



+ {i-w i yet). 
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Hence, the linear minimax risk 



B£(F) = inf sup V(n~ V 2 + (1 - Wi ) 2 6 



eeJ 7 



(8) 



j=i 



00 —1 /l9 

n l 6f 






sup 



For any real number x, write (x)+ for max(x, 0). The 
Lagrange multiplier method shows that the maxi- 
mum on the RHS of (8) is attained at Of = n~ 1 (fi/ 
ai — 1)+, where the parameter \x is determined by the 
constraint X^Si a i ®1 = ^> wn i cn is equivalent to 

CO 

n" 1 ^2 ai(/i - Oj)+ = M. 

i=l 

The minimax linear estimator is given by #1. minimax = 
(§i) with 

(9) 0i = (1 - ai/fji) + yi 

and the linear minimax risk is 



(10) 



it^)=n- 1 V(l-a J //x) + . 



i=l 



A remarkable result of Pinsker (1980) is that for el- 
lipsoidal J- the linear minimax risk is asymptotically 
equal to the minimax risk, that is, 

R* n (F)=R L n (F)(l + o(l)). 

Therefore, the minimax linear estimator #1. minimax 
given in (9) is asymptotically minimax and the min- 
imax risk is equal to the RHS of (10) asymptotically. 
In the case of special interest where the parameter 
space is a Sobolev ball 

ef (M) = le:f2(27rk) 2a (e 2 2k + 9 2 2k+1 ) <m\ 

(which corresponds to a Sobolev ball in the func- 
tion space under the usual trigonometric basis), the 
asymptotic minimax risk and the linear minimax 
risk can be evaluated explicitly as 

K,(ef(M)) = ^(ef(M))(i + o(i)) 

= 7r -2a/(l+2a) M 2/(l+2 Q )p 



(11) 



where 



n 



-2a/(l+2a)(l + (l)) ( 



Pn 



2a/(l+2a) 

(l + 2a) 1 /( 1+2Q ) 



l + a / 

is the Pinsker constant. This is the first exact evalua- 
tion of the asymptotic minimax risk in the nonpara- 



metric function estimation problem. See also Efro- 
movich and Pinsker (1982) and Nussbaum (1985). 

Pinsker's results represent a major contribution to 
nonparametric function estimation theory. Together 
they offer a complete and explicit solution to the 
problem of minimax estimation over ellipsoids. 

Linear minimaxity over quadratically convex clas- 
ses Donoho, Liu and MacGibbon (1990) considered 
certain more general quadratically convex parame- 
ter spaces. To discuss their results in more detail, 
we need first to introduce some terminology. 

A parameter space J- is called solid and orthosym- 
metric if 6 = (6\, . . . , 6 k , ■ ■ ■) £ J 7 implies that (G7 
if |£i| < \6i\ for all i. A set T is called quadrati- 
cally convex if the set {(6f)^2 x :6 G J 7 } is convex. 
The quadratic convex hull of a set T is defined as 

(12) Q.Hull(^) = {(6 % )T=x ■■ (0t)T=i G Hull(^)}, 

where T% = {(6f)f =1 : (6 { )f =x 6M>0 Vi} and 
Hullp^) denotes the closed convex hull of the set T\. 

Donoho, Liu and MacGibbon (1990) showed that 
for all solid orthosymmetric, compact and quadrati- 
cally convex parameter spaces T the linear minimax 
risk is within a 1.25 factor of the minimax risk, that 
is, 

(13) R^(F)<l.2bR* n (F). 

Hence, the optimal linear procedure cannot be sub- 
stantially improved by a nonlinear estimator. Donoho, 
Liu and MacGibbon (1990) proceeded by first solv- 
ing an infinite-dimensional hyperrectangle problem 
where the parameter space J- is of the form 

(14) T = {6 :\6i\<r u i = 1,2,...} 

with ^2nT 2 < 00. The traditional Holder smooth- 
ness constraint in the function space corresponds to 
a hyperrectangle constraint in the sequence space 
with a suitably chosen (tj). See, for example, Meyer 
(1992). The problem of estimation over a hyperrect- 
angle is solved by reducing it to coordinatewise one- 
dimensional bounded normal mean problems. 

Consider estimating a bounded normal mean 6 6 
M based on one observation y ~ N(6,a 2 ) with the 
prior knowledge that \6\ < r. It is easy to show that 
the minimax linear estimator of the bounded normal 



mean t) is 



S L (y) 



and the minimax linear risk is 



P ( r ' cr ) = „ inf SU P E e(S(y) 

linear 101 < r 



ey 



r 2 a 2 
T 2 + a 2 ' 
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Denote the minimax risk for estimating the bounded 
normal mean 9 by p*(r, a). Let p* be the maximum 
value of the ratio of p L (r,a) and p*(r,a), that is, 



(15) 



p = sup 



p L (r,a) 
t,ct P*{T,a) 



The constant p* is called the Ibragimov-Hasminskii 
constant. Ibragimov and Hasminskii (1984) stud- 
ied the properties of the ratio p L (r,a)/p*(T,a) and 
showed that the constant p* is finite. Donoho, Liu 
and MacGibbon (1990) proved that p* is in fact less 
than or equal to 1.25. 

For estimation of 6 over the hyperrectangle T 
given in (14) based on the sequence model (4), due to 
the independence of the observations yi and the in- 
dependent constraints on 0j, it is not difficult to see 
that the minimax problem is separable. That is, the 
minimax (linear) estimator can be obtained through 
coordinatewise minimax (linear) estimation. Hence, 

oo 

i=l 
oo 

R* n {F) = Y J P*{n,n- 1 ) 

i=l 

and, consequently, for hyperrectangle J 7 , 
(16) R^{7) < fi,*K(F) < 1-25KGF). 

A key step in solving the more general quadrati- 
cally convex problem is to show that the difficulty 
for the linear estimators over the quadratically con- 
vex parameter space is in fact equal to the difficulty 
for the linear estimators of the hardest rectangular 
subproblem. Then (13) follows directly from (16). 

In addition, Donoho, Liu and MacGibbon (1990) 
also showed that the linear minimax risk over a solid 
compact orthosymmetric set T is equal to that over 
the quadratic convex hull of J-, 



(17) 



i^)=i^(Q.Hull(.F)). 



This result indicates that although the optimal lin- 
ear estimator is near minimax over quadratically 
convex parameter spaces, linear procedures have se- 
rious limitations when the parameter space J- is not 
quadratically convex, especially when the quadratic 
convex hull of T is much larger than J- itself. Such 
is the case in wavelet function estimation over cer- 
tain Besov balls and in estimation of a sparse normal 
mean. 



Linear minimaxity over Besov classes We now 
turn to wavelet estimation over Besov balls. It is 
more convenient to use double indices and write the 
sequence model (4) as 



;is) 



Vj,k 



'j,k * Tl ^j,ki 



i.i.d 



zj, k ~ 2V(0,1), (j,k)el, 



where the index set I = {(j, k) : k = 1, . . . , 2-?, j = 0, 
1, . . .}. The Besov seminorm || • ||&« in the sequence 
space is then defined as 



(19) ||0| 







where s = a + ^ — . We shall assume throughout the 
paper that p,q,a,s > 0. The Besov ball B^ AM) is 
defined as a ball of radius M under this seminorm, 
that is, 



(20) 



B^ q (M) = {9:\\e\\ b?q <M}. 



Besov spaces are a very rich class of function spaces 
and occur naturally in many areas of analysis. Besov 
spaces contain as special cases several traditional 
smoothness spaces such as Holder and Sobolev spa- 
ces. For example, a Holder space is a Besov space 
with p = q = oo and a Sobolev space is a Besov 
space with p = q = 2. Full details of Besov spaces 
are given, for example, in Triebel (1992) and DeVore 
and Lorentz (1993). See Meyer (1992) and Daube- 
chies (1992) for wavelets and correspondence be- 
tween function spaces and sequence spaces. 

It is easy to verify that for p > 2 the Besov ball 
Bp AM) is quadratically convex and when p < 2, 



(21) 



Q.Hull(ZC(M)) 



B' 2 , q {M), 



where again s = a + \ — ^ . Besov spaces with p < 2 
contain functions of a high degree of spatial inho- 
mogeneity. See, for example, Triebel (1992), Meyer 
(1992) and DeVore and Lorentz (1993). Equa- 
tions (21) and (17) together imply that for the Besov 
ball B£ q (M) with p < 2, 



(22) 



Ri(B?JM)) 



RL(Q.Hu\l(B« g (M))) 
R L n {Bl q {M)). 



In particular, for p < 2 the linear minimax risk over 
BpJM) converges at the same rate as the minimax 
risk over 5|„(M). As we will see in Section 2.2, the 
minimax risk over BL* „(M) converges at the rate of 
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n -2a/(i+2a) (Donoho and Johnstone, 1998). Since 
s < a for p < 2, n ~ 2s /( 1 + 2s ) > n -2a/(i+2a) and go the 
linear minimax risk over a Besov ball B" JM) with 
p < 2 is substantially larger than the minimax risk. 
Therefore, the optimal linear estimator can be sig- 
nificantly outperformed by a nonlinear procedure. 
Intuitively, linear estimators do not perform well 
when the underlying functions are spatially inhomo- 
geneous. In this case it is thus no longer desirable 
to restrict attention to the class of linear estima- 
tors. 

Remark 1 . It is interesting to note that a similar 
phenomenon also arises in the estimation of a quadra- 
tic functional. Cai and Low (2005b) showed that for 
estimating the quadratic functional Q(0) = Y^=i@l 
in the sequence model (4), the minimax quadratic 
risk over a solid orthosymmetric parameter space T 
equals the minimax quadratic risk over the quadratic 
convex hull of J- ' . Consequently, the optimal quadratic 
estimator of the quadratic functional Q(6) is far 
from being minimax over a Besov ball BSfJM) with 
p<2. 

2.2 Separable Rules and Minimaxity 

The shortcoming of linear procedures shows that 
nonlinearity is a necessity for achieving minimax- 
ity over parameter spaces that are not quadratically 
convex, such as Besov balls BSfJM) with p < 2. Sep- 
arable rules, which apply nonlinearity to individual 
coordinates separately, are a natural generalization 
of the linear shrinkage rules. Separable rules play 
a fundamental role in minimax estimation over pa- 
rameter spaces that are not quadratically convex in 
a way similar to the role played by the linear estima- 
tors over the more conventional parametric spaces 
such as ellipsoids and hyperrectangles. 

Under the sequence model (18), an estimator 5 = 
(5j : k) is separable if for all (j, k) £ 1, 6j >k depends 
solely on y^k-, not on any other y's. We shall denote 
by S the collection of all separable rules. Well-known 
examples of separable rules include the traditional 
diagonal linear estimators, term- by-term threshold- 
ing estimators and Bayes estimators derived from in- 
dependent priors. Separable rules are attractive be- 
cause of their simplicity and intuitive appeal. More 
importantly, separable rules are minimax for a wide 
range of parameter spaces. In an important paper, 
Donoho and Johnstone (1998) pioneered the study 
of separable rules in minimax estimation over the 
Besov ball BL* JM) under the sequence model (18). 
Zhang (2005) further studied the class of separable 



rules in the context of sharp adaptation over the full 
scale of Besov balls using general empirical Bayes 
methods. 

Donoho and Johnstone (1998) began by first solv- 
ing the following minimax Bayes estimation prob- 
lem. Suppose we observe y = (yj,k) as in (18) with 
& = (6j,k) itself a random vector satisfying a mean 
constraint 

IItIU <M, 



\pAq\l/(pAq) 



where 

r j}k = (E\9 J!k r^^>, (j,k)ei, 

with p A q = min(p, q) . In other words, the "hard" con- 
straint 8 £ Bp (M) in the original minimax problem 
is replaced by the "in mean" constraint r € B" JM) 
in the minimax Bayes problem. The minimax Bayes 
risk is defined as 

R^(B^ q (M)) = ini sup E\\6-e\\l 

e r€B« q (M) 

Donoho and Johnstone (1998) showed that the min- 
imax Bayes risk R^{B^ JM)) is attained by a sep- 



arable rule 0* 



J i,k. 



of the form 



fiiVj,. 



j,k J ' 

where o~*(yj^) is a scalar nonlinear function of y^k- 
Furthermore, when a + \ > 1/(2 ApAg), the mini- 
max Bayes risk is given by 

R*(B« q (M)) 

(23) = 7 (M n 1 / 2 )M 2 /( 1+2ft )n- 2a /( 1+2ft » 

•(l + o(l)), n-^oo, 

where 7(-) is a continuous, positive, periodic func- 
tion of log 2 (Mn 1 ' 2 ). Moreover, when p> q, the min- 
imax risk is asymptotically equal to the minimax 
Bayes risk, 

R* n (B« q (M)) = Rn(.B« q (M))(l + o(l)), 

and thus separable rules are minimax. Zhang (2005) 
further showed that the optimal separable rule is 
asymptotically minimax for general (p,q). In partic- 
ular, these results showed that the minimax rate of 
convergence is n~ r * where 

(24) 



n 



That is, 



a +1/2' 
< lim n r * R* n (B£ q (M)) 

n—>oo 

< mn r *R* n (B« q (M))<oc. 

r> — inn ^'^ 
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The linear minimax rate of convergence now follows 
immediately from (13), (17), (21) and (24). The lin- 
ear minimax risk converges at the rate n~ rg where ri 
is given by 

a + (1/p- - 1/p) 
a + 1/2 + (l/p_ - 1/p) ' 

where p_ = max(p, 2). 



'Y 



It is clear that 77 = r* when p > 2 and 77 < r* when 
p <2. Hence, nonlinear separable rules can outper- 
form linear estimators at the level of convergence 
rates when p < 2. 

2.3 Rate-Optimal Coordinatewise 
Thresholding Estimator 

The separable minimax estimator that attains the 
minimax Bayes risk (23) is not available in closed 
form. Donoho and Johnstone (1998) showed that at- 
tention can be further restricted to a simpler coor- 
dinatewise thresholding estimator. It is shown that 
the optimal term-by-term thresholding estimator is 
within a small constant factor of the minimax risk. 
It was noted in Donoho and Johnstone (1998) that 
the constant factor is A(p A q) < 1.6 for p A q = 1 
using computational experiments and A(p Aq) < 2.2 
for pAq = l for the essentially quadratically convex 
(and thus less important) case of p > 2. However, no 
specific rate optimal thresholding estimator is given 
in their paper. 

We now present a rate-optimal coordinatewise 
thresholding estimator. Consider the sequence 
model (18). Let Jo and J be integers satisfying, 
respectively, M 2/(i+2a) n i/(i+2a) < 2 J < 

2M 2/(i+2a) n i/(l+2a) and n < 2 J < 2n. For j > J + 1, 
let 



(25) 



\ j = J2n- l \og{2i~ J °) 



and let n\(y) = sgn(y)(\y\ — A)+ be the soft thresh- 
old function. We define the following thresholding 
estimator: 

(Vj,k, if 1 < J < Jo , 

(26) 9 jt k = < r] Xj (yj,k), if Jo<3< J, 

[ 0, if j > J. 

The estimator given in (26) is similar to the wavelet 
estimator given in Delyon and Juditsky (1996) for 
density estimation and nonparametric regression over 
BpJM) under the Sobolev norm loss. It differs from 
the estimator in Delyon and Juditsky (1996) in the 
choice of the lower and upper resolution levels Jq 



and J as well as in the choice of the thresholds Xj . 
The following theorem can be shown using the same 
proof as given in Delyon and Juditsky (1996). 

Theorem 1. The separable estimator 9 given 
in (26) is within a constant factor of the minimax 
risk over the Besov ball BpJM). That is, 

Rn(0,B« q (M)) < C(a,p,q)R* n (B« q (M)), 

where the constant C(a,p,q) depends only on a, p 
and q. In particular, the estimator is minimax rate- 
optimal, 

(27) urn n 2a/( - l+2a) sup E\\6 - 0||| < 00. 
n ^°° 9eB« q (M) 

3. ADAPTIVE ESTIMATION THROUGH 
INFORMATION POOLING 

Minimax risk provides a useful uniform bench- 
mark for the comparison of estimators. However, the 
minimax estimators discussed in Section 2 require 
some explicit knowledge of the parameter space which 
is unknown in practice. A minimax estimator de- 
signed for a specific parameter space typically per- 
forms poorly over another parameter space. Recent 
work on nonparametric function estimation has fo- 
cused attention on adaptive estimation, with the 
goal of constructing a single procedure which is near 
minimax simultaneously over a collection of param- 
eter spaces. As mentioned in the Introduction, whe- 
ther this goal can be accomplished depends strongly 
on how risk is measured. When the performance is 
measured by the global MISE risk sharp adaptation 
over Besov balls can be achieved. In fact, a large 
number of adaptive procedures have been developed 
in the literature. In this section we consider adap- 
tive estimation under the MISE risk. For reasons 
of space, we do not give a comprehensive review of 
these adaptive estimators. We shall focus the dis- 
cussion only on block thresholding which naturally 
connects shrinkage rules developed in the classical 
normal decision theory with nonparametric function 
estimation. 

Because of the optimal performance of the sep- 
arable rules in the minimax estimation setting, we 
begin in Section 3.1 by studying the adaptability of 
the separable rules. The results show that separa- 
ble rules have their limitations; they cannot be rate 
adaptive, which implies that information pooling is 
the key to achieve adaptation. We then consider in 
Section 3.2 adaptive block thresholding estimators 
through ideal adaptation with oracle. 
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3.1 Adaptability of Separable Rules 

As discussed in Section 2, Zhang (2005) showed 
that separable rules are asymptotically minimax over 
any given Besov ball BpJM). Hence, from a min- 
imax point of view there is little to gain by look- 
ing beyond the separable rules when the parame- 
ters (a,p,q) are fully specified. A natural question 
is whether separable rules can achieve the minimax 
rate of convergence simultaneously over a collection 
of Besov balls. To answer this question, we begin 
with a simple version of the adaptation problem by 
considering only two Besov balls. Let B^ (Mi) 
and Bp^ q2 (M2) be two Besov balls with a.\ ^ a%. 
We call an estimator 5 rate-adaptive over the two 
Besov balls if 5 attains the minimax rate simultane- 
ously over both of them, that is, 



(28) 



max lim n 2a i/( 1 + 2 «i) 
i=l,2 n— >oo 



eeB 



sup 



E\\5- 



li <°o- 



(Mi) 



The question is: can (28) be achieved by a separa- 
ble rule? To answer the question, Cai (2008) showed 
that separable rules are "inflexible": any rate-opti- 
mal separable rule over a Besov ball B" „(M) must 
have a "flat" rate of convergence everywhere in 
BpJM). If a separable rule 5 satisfies 

sup E\\8-e\\l<Cn- 2a '^ +2 ^ 

6eB£ q (M) 

for some constant C > 0, then for any given 9 G 
B2JM), 



(29) 



0< hmn 2a ^ 1+2 ^E\\5-e\\ 2 2 

: \h^n 2a ^ 1+2 ^E\\5-9\\ 2 <oo. 



That is, 5 must attain the exact same rate at every 
point 9 G BpJM). This is not the case for nonsepa- 
rable rules. Indeed, there exist estimators that con- 
verge faster than the minimax rate at every point in 
Bp AM). See Brown, Low and Zhao (1997), Zhang 
(2005) and Cai (2008). As a direct consequence of 
the inflexibility of the separable rules, they are nec- 
essarily not rate-adaptive. That is, if a\ ^ a 2 , then 



(30) 



max lim n W(i+2<*) 

i=l 2 n— >oo 



• inf 

ses 



sup E\\5 — 9\\ 2 = oo. 



eeB" 



The lack of adaptability of separable rules is close- 
ly connected to superefficiency in the classical uni- 
variate normal mean problem. It is well known that 



if an estimator of a univariate normal mean is su- 
perefficient at a point it must pay for the superef- 
ficiency by being subefficient in a neighborhood of 
that point. The Hodges estimator is an example of 
such estimators. See Le Cam (1953) and Brown and 
Low (1996b). 

Under the sequence model (18), the minimax rate 
of convergence over the Besov ball BpJM) is 
n -2a/(i+2a) _ yy e ca jj an es ti ma tor 5 superefficient 
at a fixed point 9 G B* q (M) if 

n *x/(i+2a) EU _ e f 



0. 



A heuristic proof of (29) sheds light on the cause 
of the lack of adaptability for separable rules. Let 
8 = ($j,k) be a minimax rate-optimal separable rule 
over BpJM). Then individually each Sj t k can be re- 
garded as an estimator in a univariate normal mean 
problem. If 5 is superefficient at some 9* G Bp q (M), 
then, as a univariate normal mean problem, ma- 
ny 5j± are superefficient at 9* k and, thus, each of 
these Sjk must be penalized in a subefficient neigh- 
borhood of 9* k . There exists some 9' G B" JM) with 
coordinates 9'- k in those subefficient neighborhoods 
of 9* k . As a consequence of 5 being superefficient 

at 9*, 5 is subefficient at 9' relative to the minimax 
risk over BpJM). This contradicts the assumption 
that 5 is rate-optimal uniformly over Bp JM). A rig- 
orous argument can be found in Cai (2008). The 
main reason this phenomenon occurs is that separa- 
ble rules estimate each coordinate 9j t k based solely 
on an individual observation yjk- Estimation accu- 
racy can be improved by pooling information on dif- 
ferent coordinates to make more informative and ac- 
curate decisions. 

Equation (30) shows that separable rules need to 
pay a price for adaptation. The minimum cost of ada- 
ptation for the separable rules is at least a logarith- 
mic factor. Suppose a\ > a 2 . If a separable rule 5 at- 
tains the minimax rate n 2a Jy. 1 + 2a ^) over B^ (Mi), 
then 

_ 2a 2 /(l+2a 2 ) 

lim 



(31) 



logra 



sup E\\S-6 
6eB%l q2 (M 2 ) 



II > o. 



This lower bound bears a strong similarity to the 
problem of adaptive estimation of a function at 
a point. See Section 4. 

The lower bound (31) can indeed be attained by 
a separable rule. The well-known VisuShrink esti- 
mator of Donoho and Johnstone (1994) adaptively 
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achieves within a logarithmic factor of the minimax 
risk. It is thus optimal among separable rules in the 
sense that it attains the lower bound on the adaptive 
convergence rate within this class of estimators. 

To motivate the VisuShrink estimator, we begin 
with the classical multivariate normal mean model (1) 
and outline an oracle approach developed in Donoho 
and Johnstone (1994). Suppose we wish to estimate 
9 = (#i, . . . ,9 m ) based on the observations x = (xi, 
. . . , x m ) in (1) under mean squared error (2). 

In the discussion that follows, we focus on the sep- 
arable rules. An ideal separable "estimator" # ldeal 
would estimate &i by xi when 9 2 > a 2 and by other- 
wise, that is, df eal = Xi I(9 2 > a 2 ). This "estimator" 
achieves ideal trade-off between variance and squa- 
red bias for each coordinate and attains the ideal risk 



(32) 



1 m 
#DP. oracle (#) = — / .(0{ A G 



i=l 



Since the "estimator" 9 a requires the knowledge 
of the unknown 6, it is not a true statistical estima- 
tor. The ideal risk (32) is unattainable in practice, 
but it does provide a useful benchmark. To mimic 
the performance of the ideal "estimator" # ldeal ; Do- 
noho and Johnstone (1994) proposed the soft thresh- 
old estimator 

(33) 9* =sgn(xi)(\xi\ - r) + , 

with r = ay/2 log m, and showed the following Ora- 
cle Inequality: 

R(§*,9) 

(34) < (21ogm+ l)[i*DP.oracle(0) + a 2 /m], 

for all 9 £ R rn . 

Hence, the soft threshold estimator 9* comes within 
a logarithmic factor of the ideal risk for all 9 € M m . 
Moreover, the factor 21ogm in the Oracle Inequali- 
ty (34) is asymptotically sharp in the following sense: 



E\\9-9\\ 2 2 



(35) 



fS^+^i^ 2 .^ 2 ) 



21ogm(l + o(l)), 



m —f oo. 



A similar result to (35) is given in Foster and George 
(1994) in the linear regression setting. 

In the setting of the Gaussian sequence model (18), 
VisuShrink is defined as 



(36) 9 



sm(yj,k)(\yj,k\ - \/2n T logn) + , 

if j < J, 
0, iij>J, 



where J = [log 2 n\ . The VisuShrink estimator adap- 
tively achieves the rate of convergence (logn/ 
n )2«/(i+2a) over tlie Besov balls B£ q (M) (Donoho 

et al., 1995). That is, 

(37) sup E\\9-9\\ 2 <C[^\ 



eeBgjM) 



n 



where C > is a constant not depending on n. In 
light of the lower bound (31), VisuShrink is thus 
optimal within the class of separable rules. 

3.2 Block Thresholding via Ideal 
Adaptation with Oracle 

The results in Section 3.1 show that information 
pooling is a necessity for achieving full adaptation. 
Block thresholding, which estimates the coordinates 
in groups rather than individually, provides a conve- 
nient and effective tool for information pooling. Block 
thresholding increases estimation precision and 
achieves adaptivity by utilizing information about 
neighboring coordinates. The degree of adaptivity, 
however, depends on the choice of block size and 
threshold level. 

We study block thresholding rules via the approach 
of ideal adaptation with an oracle. The main ideas 
of the oracle approach have been outlined at the end 
of Section 3.1 in developing the VisuShrink estima- 
tor. An oracle does not reveal the true estimand, 
but provides the ideal choice within a given class of 
estimators. The oracle "estimator" is typically not 
a true statistical estimator, as it may depend on the 
unknown parameter. It represents an ideal for a par- 
ticular estimation method. The goal of ideal adap- 
tation is to derive true statistical estimators which 
can essentially mimic the performance of an oracle. 

The soft threshold estimator (33) estimates coor- 
dinates individually without using information about 
other coordinates. As we have shown in Section 3.1, 
such a separable rule is not optimal for adaptive 
estimation. We thus consider a more general class 
of estimators, the block projection (BP) estimators, 
which use information about neighboring coordina- 
tes by thresholding observations in groups. Simulta- 
neous decisions are made to retain or discard all the 
coordinates within the same group. 

We again begin with the finite-dimensional mul- 
tivariate normal mean model (1). We wish to es- 
timate the mean 9 = (#i, . . . ,9 m ) based on the ob- 
servations x = (xi, . . . ,x m ) in (1) under the mean 
squared error (2). Let B\, B2, . . . , -B/v be a partition 
of the index set { 1 , . . . , m} with each Bi of size L 
(for convenience, we assume that the sample size m 
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is divisible by the block size L). Let H be a subset 
of the block indices {1, . . . , N}. A block projection 
estimator 9(H) is defined as 

9b , (H) = %Bi if j € % and 
(38) 

§ Bj (H) = o a j in, 

where xr. = (xj)j £ r . The risk of 9(H) is 



(39) 



R{9(n),t 



LJ2{La 2 I{j€U) + \\9 Bi \\ll{JiH)}. 

to *-^ J 



Ideally, one would like to choose H to consist of 
blocks j where ||0b-||| > La 2 . A BP oracle provides 
exactly this side information H* = H*(9) = {j : 
II^BjIl! > La 2 }, which yields the ideal block pro- 
jection "estimator" #(%*) with 9bj (W*) = xb j I(J € 
H*) with the ideal risk 



(40) 



#BP.oracle(0, L) = hlf -E\\9(H) ~ 9\\ 2 

H m 



1 N 
in ^— ' 



Ma 1 



The ideal "estimator" 9(14*) is not a true statis- 
tical estimator. A natural goal is to construct an 
estimator which can mimic the performance of the 
BP oracle. 

Since Stein's 1956 seminar paper, many shrink- 
age estimators have been developed in the multi- 
variate normal decision theory. Among them, the 
(positive part) James-Stein estimator is perhaps the 
best-known. Efron and Morris (1973) showed that 
the (positive part) James-Stein estimator does more 
than just demonstrate the inadequacy of the maxi- 
mum likelihood estimator; it is a member of a class 
of good shrinkage rules, all of which may be useful 
in different estimation problems. Indeed, as we shall 
see below, blockwise James-Stein rules can essen- 
tially mimic the performance of the BP oracle when 
the threshold is properly chosen. For each block Bj 
let Sa = yi-cR. x 2 and set 



(41) 



k 3 -0M) 



i 



\La< 



XBj 



Then the blockwise James-Stein estimator satisfies 
the following BP Oracle Inequality: 

R(9(L,X),9) 

< XR B p. OTaclc (9,L) + 4c7 2 ■ P( X \ > XL), 



(42) 



where x\ denotes a central chi-squared random vari- 
able with L degrees of freedom. 

Remark 2. When the block size L = l, the es- 
timator (41) becomes a coordinatewise thresholding 
estimator. It is easy to show that with the choice of 
A = 2 log to the BP Oracle Inequality (42) is equiva- 
lent to the Oracle Inequality (34) of Donoho and 
Johnstone (1994). The resulting estimator shares 
similar properties with the VisuShrink estimator. 
See Gao (1998). 

Remark 3 . Another special choice of block size 
is L = L*= log to. The corresponding threshold is 
A = A* = 4.50524 (the solution of A - log A - 3 = 0). 
The pair (L*, A*) is chosen so that the corresponding 
estimator in the Gaussian sequence model is (near) 
optimal. See the discussion below. In this case the 
BP Oracle Inequality becomes 



(43) R(9(L*, A*), 9) < A*i?BP.oraclc(#, L*) + 



2a 1 



rn 



Therefore, with block size L* = log to and thresh- 
olding constant A* = 4.50524, the estimator comes 
essentially within a constant factor of 4.50524 of 
the ideal risk. Note that this blockwise James-Stein 
estimator is not minimax for a given block (since 
A* > 2), but it is close to being minimax and A* = 
4.50524 is needed for the optimal performance in the 
infinite-dimensional Gaussian sequence model. 

Remark 4. Instead of the block projection esti- 
mators given in (38), one can also consider the more 
general block linear shrinkers: 9bj = Ij^Bj , lj G [0, 1] • 
In the case of block projection, jj £ {0, 1}. An or- 
acle would provide the ideal shrinkage factors jj = 
ll^-BjIII/dl^-BjIli + La 2 ), and the ideal "estimator" 
has the risk 



R 



BLS. oracle \ 



,L) 



1 N 

TO *— ' 



WOB.WlLa 2 
e Bj \\ 2 2 + La 2 ' 



The blockwise James-Stein estimator (41) also 
mimics the performance of the block linear shrinker 
oracle, 



(44) 



R(9(L,X),( 



< 2XR BLS . OTaclc (9,L) + 4cr 2 • P(xl > XL). 

We now return to the Gaussian sequence model (18) 
and consider the BlockJS procedure introduced in 
Cai (1999). Let J = [log 2 n]. Divide each resolution 
level 1 < j < J into nonoverlapping blocks of length 
L = L* = [log n] . (The coordinates in the first few 
resolution levels are grouped into a single block.) 
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Let b? denote the set of indices of the coordinates in 
the ith block at level j, that is, 

H = {(j,k) : (i - 1)1 + 1 < k <iL}. 
Set 5?j = ^2 kelr i y'j f.- We then apply the James- 
Stein shrinkage rule to each block bj . For (j, k) G 6] , 

A*Lra~ 



(45) 



7 j,fc 



1 



% 



Vj,k, 



for {j,k)ebj,j<J, 

, 0, for j > J, 

where A* = 4.50524 is the solution of A — log A — 3 = 
0. This threshold is derived based on the tail proba- 
bility of a chi-squared distribution. See Cai (1999). 
The BlockJS estimator (45) is adaptively within 
a constant factor of the minimax risk over all Besov 
balls BS, 



(M) for p>2 and is within a logarithmic 
off 
for p < 2, 



factor of the minimax risk over Besov balls B" AM) 



'p,q\ 



sup 



E\\9*-e 



(46) 



< < 



Cn -2a/(l+2a) 

for p > 2 

C n -2a/(l+2a)( logn )(2/p-l)/(l+2a) 

for p < 2 and ap > 1. 

The block size and threshold level play important 
roles in the performance of a block thresholding es- 
timator. The block size L* = logra and threshold 
A* = 4.50524 are shown in Cai (1999) to be optimal 
in the sense that the resulting BlockJS estimator is 
both globally and locally adaptive. The extra loga- 
rithmic factor in the case of p < 2 is unavoidable for 
any block thresholding estimators with fixed block 
size and threshold. 

Adaptation can be achieved through empirically 
selecting the block size and threshold at each resolu- 
tion level by minimizing Stein's Unbiased Risk Esti- 
mate (Cai and Zhou, 2009). Let yj_ = (y^i, . . . , Vjpi)- 
Since the positive part James-Stein estimator (41) is 
weakly differentiable, Stein's formula (Stein, 1981) 
for unbiased estimate of risk shows that 



SURE(yj.,L,A) 



^ A 2 L 2 -2AL( L-2) 

i b {jb) 

^2 nr\ 7Ve2 



I(Sl z > XL) 



+ (Sli-2L)-I(Sli<XL) 

is an unbiased estimate of the risk at level j. Choose 
the level- dependent block size Lj and threshold Xj 



to be the minimizer of SURE: 

(Lj,Xj) = argminSURE(yj.,L, A). 

The resulting estimator, called SureBlock, auto- 
matically adapts to the sparsity of the underlying se- 
quence 8. In particular, the estimator is sharp adapti- 
ve over all Besov balls B^^M) and simultaneously 
achieves within a factor of 1.25 of the minimax risk 
over Besov balls B£ q (M) for all p > 2, q > 2. At the 
same time the SureBlock estimator achieves adapti- 
vely within a constant factor of the minimax risk 
over a wide collection of Besov balls 5" „(M) in the 
"sparse case" p < 2. These properties are not shared 
simultaneously by other commonly used threshold- 
ing procedures such as VisuShrink (Donoho and 
Johnstone, 1994), SureShrink (Donoho and John- 
stone, 1995) or BlockJS (Cai, 1999). 

3.3 Discussion 

The idea of block thresholding can be traced back 
to Efromovich (1985) in estimation using the trigono- 
metric basis. A similar construction was used in 
Brown, Low and Zhao (1997) to produce superef- 
ficient estimators. In the context of wavelet estima- 
tion, global level-by-level thresholding was discussed 
in Donoho and Johnstone (1995) for regression and 
in Kerkyacharian, Picard and Tribouley (1996) for 
density estimation. Cavalier and Tsybakov (2002) 
and Cavalier et al. (2003) and Cai, Low and Zhao 
(2009) used weakly geometrically growing block size 
for sharp adaptation over ellipsoids. But these block 
thresholding methods are not local, they essentially 
adaptively mimic the performance of the ideal linear 
estimator. Because of the serious limitations of the 
linear procedures for estimating spatially inhomoge- 
neous functions discussed at the end of Section 2.1, 
these estimators do not enjoy a high degree of spa- 
tial adaptivity. In particular, these estimators do not 
perform well over parameter spaces which are not 
quadratically convex such as Besov balls B" JM) 
with p < 2. 

Hall, Kerkyacharian and Picard (1998, 1999) in- 
troduced a local blockwise hard thresholding proce- 
dure for density estimation and nonparametric re- 
gression with a block size of the order (logn) 2 whe- 
re n is the sample size. Cai and Silverman (2001) 
considered overlapping block thresholding estima- 
tors. Block thresholding is a widely applicable tech- 
nique. Cai and Low (2005b, 2006b) use block thresh- 
olding procedures for minimax as well as optimal 
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adaptive estimation of a quadratic functional and 
Cai and Low (2006a) used a block thresholding me- 
thod for the construction of adaptive confidence balls. 
We have focused the discussion on blockwise Ja- 
mes-Stein procedures because of their simplicity. 
In addition to the James-Stein rule, through block 
thresholding, many other shrinkage rules developed 
in the classical normal decision theory can be ap- 
plied as well. For example, estimators of the forms 

9=[l-X 1 a 2 /(X 2 + S 2 )] + y or = [1 - c(S 2 )] + y, 

where S 2 = [|j/||| and c(-) is a suitably chosen func- 
tion, can also be used. Besides block thresholding, 
the empirical Bayes method is another natural choice 
for information pooling and for constructing adap- 
tive procedures. See Johnstone and Silverman (2005) 
and Zhang (2005). In particular, Zhang (2005) pre- 
sented a class of general empirical Bayes estimators 
that are adaptively sharp minimax over a large col- 
lection of Besov balls. Other methods such as choos- 
ing a threshold by controlling the false discovery rate 
can also be used. See Abramovich et al. (2006). 

4. MINIMAX AND ADAPTIVE ESTIMATION 
UNDER POINTWISE LOSS 

So far the focus has been on the minimax and 
adaptive estimation under the global MISE risk (5). 
For functions of spatial inhomogeneity, the local 
smoothness of the functions varies significantly from 
point to point and global risk measures such as (5) 
cannot wholly reflect the local performance of an es- 
timator. The most commonly used measure of local 
accuracy is pointwise squared error loss. While the 
minimax theory under the pointwise loss is similar to 
that for the global loss, the adaptation theories for 
the two losses are significantly different. Under the 
local loss it is often the case that sharp adaptation is 
not possible and a penalty, usually a logarithmic fac- 
tor, must be paid for not knowing the smoothness. 
Estimation under the pointwise risk (47) is a special 
case of estimating a linear functional T(f). A gen- 
eral theory for estimating linear functionals has been 
developed in the literature. In this section we shall 
first focus on estimating a function under the point- 
wise risk and present a concise account of both the 
minimax and adaptation results. The related mini- 
max and adaptation theory for estimating a general 
linear functional is discussed in Section 4.1. 

We shall return to the white noise model (3) and 
consider estimation under pointwise squared error 
risk 



where to £ (0, 1) is any fixed point. For a given pa- 
rameter space J-, the difficulty of the estimation 
problem is measured by the minimax risk 

(48) K(F;t ) = M su V E f (f (to) ~ f(to)f. 

f f£F 
Several methods have been developed to study the 
minimax estimation problem. These include modu- 
lus of continuity, metric entropy, information inequa- 
lity, renormalization and constrained risk inequality. 
See, for example, Farrell (1972), Hasminskii (1979), 
Stone (1980), Ibragimov and Hasminskii (1984), Do- 
noho and Liu (1991), Brown and Low (1991), Low 
(1992), Donoho and Low (1992) and Birge and Mas- 
sart (1995). For example, the minimax risk over any 
convex parameter space can be characterized, up to 
a small constant factor, in terms of the modulus of 
continuity. For estimation over the Besov balls, the 
minimax rate of convergence of the pointwise risk 
is derived in Cai (2003) using a constrained risk in- 
equality. It is shown that the minimax risk satisfies 

(49) K(B« q (M);t Q )~n- 2 ^ 1 + 2l '\ 

where v = a — -. Unlike the minimax rate of conver- 
v 

gence under the global risk, the local minimax rate 
of convergence depends on the parameter p as well. 
Minimax rate optimal estimators can be constructed 
using wavelet thresholding. 

The behavior of the estimators which are minimax 
rate optimal under the pointwise risk is quite differ- 
ent from that of rate optimal estimators under the 
global MISE risk. It is shown in Cai (2003) that if an 
estimator / attains the minimax rate of convergence 
over a Besov ball B£ (M), then it must attain the 
same "flat" rate at every / in the parameter space; 
superefficiency is not possible for rate optimal esti- 
mators. That is, if 



lim n 



2v/(l+2v) 



(50) 



sup E f (f(t ) 



f(t )) 2 <oo, 



(47) 



R(f,f;t ) = E f (f(t )-f(t )) 2 , 



then the estimator / must also satisfy 

(51) lim n 2 ^ 1+2 ^E f (f(t ) - f(t )f > 

n— >oo 

for any fixed / E B"JM). In contrast, under the glo- 
bal MISE risk, rate-optimal estimators over B" JM) 
can achieve a much faster rate at some parame- 
ter points. Indeed, it is possible to have estimators 
which converge at a rate faster than the minimax 
rate at every fixed function in B" (M); see Brown, 
Low and Zhao (1997), Zhang (2005) and Cai (2008). 
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Pioneering work on adaptive estimation under the 
pointwise risk began with Lepski (1990). This work 
focused on Lipschitz balls and showed that it is 
impossible to achieve complete adaptation for free 
when the smoothness parameter is unknown. One 
must pay a price for adaptation. Lepski (1990) and 
Brown and Low (1996b) showed that the cost of 
adaptation is at least a logarithmic factor even when 
the smoothness parameter is known to be one of 
two values. The case of the Sobolev balls was inves- 
tigated by Tsybakov (1998). Cai (2003) considered 
adaptation over Besov balls. 

The inflexibility of the minimax rate optimal es- 
timators has direct consequence for adaptive esti- 
mation over Besov balls under the pointwise loss. 
Adaptation for free is only possible if the rates of 
convergence over the collection of the Besov balls 

are the same, that is, v = a is a fixed con- 

v 
stant for all Besov balls in the collection. Other- 
wise, a penalty must be paid for adaptation, even 
over two Besov balls B"' (Mi), i = 1,2. Let V\ = 
ai — 1/pi for i = 1,2 and suppose v\ > v-i > 0. If an 
estimator / attains a rate of n p over S" 1 (Mi) with 

p > 2^/(1 + 2^2), in particular, if / is rate-optimal 
over Bgl qi (Mi), then 



lim 



)i 



\ 2u 2 /(l+2u 2 ) 



log re 






■ sup E f (f(t )-f(t )) 2 >0. 

/6B^, 92 (M 2 ) 

Therefore, the minimum cost for adaptation is at 
least a logarithmic factor. Furthermore, the rate (n/ 
\ogn) 2h '1 ( l+2u > can be adaptively attained, for ex- 
ample, by the VisuShrink estimator of Donoho and 
Johnstone (1994) and the BlockJS estimator dis- 
cussed in Section 3.2. See Cai (2003). 

Remark. We have focused on adaptation over 
different parameter spaces under a given loss. There 
is another type of adaptation problem which can 
be termed as loss adaptation: given a fixed param- 
eter space, is it possible to construct an estimator 
that adapts to the loss function in the sense that 
the estimator is optimal both locally and globally? 
This problem was considered in Cai, Low and Zhao 
(2007). It was shown that it is impossible for any 
estimator to simultaneously attain the global min- 
imax rate of convergence and the local minimax 
rate at every point when the global and local min- 
imax rates are different. The minimum penalty for 



a global rate-optimal estimator is a logarithmic fac- 
tor in terms of the maximum pointwise risk over 
B" „(M). The wavelet thresholding estimator with 
coefficients estimated by (26) is optimally loss adap- 
tive in this sense. 

4.1 Discussion on Estimation 
of Linear Functionals 

The problem of estimating a function under the 
pointwise risk (47) is a special case of estimating 
a linear functional T(f). For a given linear func- 
tional T and a parameter space J- define the linear 
minimax risk R^(F,T) and minimax risk R^(J-, T), 
respectively, by 

R^(F,T)= inf supE f (f-T(f)) 2 and 

f linear feT 

R* n (F,T) = mf sup E f (f-T(f)) 2 . 
t fer 

The minimax theory for estimating a linear func- 
tional T over a convex parameter space has been well 
developed. See, for example, Ibragimov and Has- 
minskii (1984), Donoho and Liu (1991) and Donoho 
(1994). In particular, the properties of the minimax 
linear estimators can be described precisely and the 
linear minimax risk R^ (J 7 , T) is within a small con- 
stant factor (<1.25) of the minimax risk R^J 7 , T), 
that is, 

R^(F,T)<p*R* n (F,T) < l.2bR* n (F,T), 

where /i* is the Ibragimov-Hasminskii constant given 
in (15). A fundamental quantity which captures the 
difficulty of the estimation problem in this setting is 
the modulus of continuity 

u(e,J r ) 

(52) =sup{|T( ff )-T(/)|:|kj-/|| 2 < e , 

For example, the linear minimax risk is given by 

^(e,T) 



(53) 



2#(.F,T) = sup 



£>0 4 + ree^ 



and satisfies 



1, .2 



u?(n- x l 2 , LP) < R* n (F,T) < R^(F,T) 

<u 2 (n-^ 2 ,F). 

See Ibragimov and Hasminskii (1984) and Donoho 
and Liu (1991). 
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In most common cases when estimating a linear 
functional over convex parameter spaces the modu- 
lus is Holderian, 

(54) oo(e^)=Ce q ^\l + o{\)). 

In this case the exponent q{J-) determines the min- 
imax rate of convergence. Hence, the rate of con- 
vergence is captured by the geometric quantity u. 
Furthermore, Donoho and Liu (1991) showed that 
the modulus can be used to give a recipe for con- 
structing the minimax linear estimator. A key step 
in this analysis is to show that the difficulty for lin- 
ear estimators over a convex parameter space is in 
fact equal to the difficulty for linear estimators of the 
hardest one-dimensional subproblem. This problem 
is again closely connected to the problem of estimat- 
ing a one-dimensional bounded normal mean dis- 
cussed in Section 2.1. Cai and Low (2004a) extended 
the minimax theory for estimating linear functionals 
to nonconvex parameter spaces. It is shown that in 
this setting while the minimax rate of convergence 
is still determined by the modulus of continuity, the 
linear minimax risk can be arbitrarily far from the 
minimax risk. In fact, even if the parameter space is 
only a union of two convex sets, it is possible that 
the maximum risk of the best linear estimator does 
not even converge even though the minimax risk 
converges quickly. This shows that linear estimators 
have serious limitations when the parameter space 
is not convex. 

The adaptation theory for estimating linear func- 
tionals is less well developed. As mentioned ear- 
lier, Lepski (1990) was the first to give examples 
which demonstrated that rate optimal adaptation 
over a collection of Lipschitz classes is not possible 
when estimating the function at a point. Efromovich 
and Low (1994) showed that this phenomena is true 
in general over a collection of nested symmetric sets. 
On the other hand, the goal of rate adaptive estima- 
tion of linear functionals can sometimes be realized. 
When the minimax rates over each parameter space 
are slower than any algebraic rate, Cai and Low 
(2003) have given examples of nested symmetric sets 
where sharp adaptive estimators can be constructed. 
In addition, when the parameter spaces are not sym- 
metric, there are also examples where rate adap- 
tive estimators can be constructed. See Efromovich 
(1997a, 1997b, 2000), Lepski and Levit (1998), Efro- 
movich and Koltchinskii (2001) and Kang and Low 
(2002). 

A general adaptation theory for estimating lin- 
ear functionals is given in Cai and Low (2005a). 



This theory gives a geometric characterization of 
the adaptation problem analogous to that given by 
Donoho (1994) for minimax theory. This theory de- 
scribes exactly when rate adaptive estimators ex- 
ist, and when they do not exist the theory provides 
a general construction of estimators with minimum 
adaptation cost. 

It is shown that two geometric quantities, a be- 
tween class modulus of continuity and an ordered 
modulus of continuity, play a fundamental role in 
the adaptation theory. The between class modulus 
of continuity, defined by 



(55) 



:sup{|rG7)-r(/)|:||3-/|| 2 <e; 



feFuge^}, 

captures the degree of adaptability over two convex 
parameter spaces in the same way that the usual 
modulus of continuity used by Donoho and Liu (1991) 
and Donoho (1994) captures the minimax difficulty 
of estimation over a single convex parameter space. 
The ordered modulus of continuity, given by 

w(e,Ji,J2) 



(56) 



sup{r( 5 )-T(/):|| 5 -/|| 2 < £ ; 



is instrumental in the construction of adaptive esti- 
mators with minimum adaptation cost. 

The theory shows that there are three main cases 
in terms of the cost of adaptation. In the first case, 
the cost of adaptation is a logarithmic factor of n. 
This is the case for estimating a function at a point 
over Lipschitz balls. In the second case sharp adap- 
tation is possible as in the examples considered in 
Lepski and Levit (1998) and Cai and Low (2003). 
This is also the case when estimating a convex or 
some other shape constrained function at a point. 
More dramatically, in the third case the cost of adap- 
tation is much greater than in the first case. The cost 
of adaptation in this case is a power of n. 

5. MINIMAX AND ADAPTIVE 
CONFIDENCE INTERVALS 

The construction of confidence sets is an impor- 
tant part of statistical inference. As mentioned in 
the introduction, there are several types of nonpara- 
metric confidence sets including confidence inter- 
vals, confidence bands and confidence balls. For ex- 
ample, Li (1989), Beran and Dumbgen (1998), Gen- 
ovese and Wasserman (2005), Cai and Low (2006a) 
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and Robins and van der Vaart (2006) have con- 
structed confidence balls with near optimal variable 
radius which also guarantee coverage probability. 
Adaptive confidence bands have been constructed 
in the special case of shape restricted functions. See 
Hengartner and Stark (1995) and Diimbgen (1998). 
See also Genovese and Wasserman (2008). 

In this section we shall focus our discussion on 
pointwise confidence intervals for a function. Simi- 
lar to estimation under the pointwise risk, this prob- 
lem is a special case of confidence intervals for lin- 
ear functionals. Both minimax theory and adapta- 
tion theory for confidence intervals of linear func- 
tionals have been developed. In this section we shall 
first discuss the general theory and then use confi- 
dence intervals for a function at a point as examples. 
Again, we will mainly use the Besov balls B°f (M) 
as the examples. The usual cases of Holder balls 
and Sobolev balls follow by taking p = q = oo and 
p = q = 2, respectively. 

For any confidence interval there are two inter- 
related issues which need to be considered together, 
coverage probability and the expected length. A min- 
imax theory for confidence intervals of linear func- 
tionals was given in Donoho (1994) for convex pa- 
rameter spaces. In this setting the goal is to con- 
struct confidence intervals with a prespecified cover- 
age probability which minimizes the expected length 
of the interval. Write l^j for the collection of all 
confidence intervals which cover the linear functio- 
nal T{f) with minimum coverage probability of 1 — 7 
over the parameter space J-. Denote by 

L{CI,J) = sup E f {L(CI)) 

the maximum expected length of a confidence in- 
terval CI over T where L(CI) is the length of CI. 
The benchmark is the minimax expected length of 
confidence intervals in 2T 7] jr, 

(57) L;0F)= mf sup E f (L(CI)). 

For convex J 7 , Donoho (1994) showed that the 
modulus of continuity defined in (52) determines the 
minimax expected length, 



(58) 



2w(2z 7 n" 1/2 ,J") 



<L: i {F)<2oj{2z l/2 n' l ^ 1 F), 



1/2 



where z 7 is the 100(1 — 7)th percentile of the stan- 
dard normal distribution. Moreover, Donoho (1994) 



constructed fixed length intervals centered at lin- 
ear estimators which have maximum length within 
a small constant factor of the minimax expected 
length L*{J-). Hence, from a minimax point of view 
there is relatively little to gain by centering the 
intervals on nonlinear estimators or using variable 
length intervals. 

When the linear functional T is a point evaluation 
at to £ (0, 1), that is, T(f) = f(to), and the param- 
eter space is the Besov ball Bp q (M), the modulus 

satisfies, with v = a , 

v 

u(n~ l l\B^ q {M))=Cn- v l { - l+2v Xl + o{\)). 

Following the recipe given in Donoho (1994), one 
can construct a fixed length 1 — 7 level interval cen- 
tered at a linear estimator with the length of order 

n -"/(l+2i'). 

The situation changes significantly when the pa- 
rameter space is not convex. Cai and Low (2004a) 
developed a minimax theory for parameter spaces 
that are finite unions of convex parameter spaces. 
It is shown that in this case the optimal (variable 
length) confidence interval centered at linear esti- 
mators can have expected length much longer than 
the minimax expected length; it is thus essential to 
center the interval at nonlinear estimators in order 
to achieve optimality. 

When attention is focused on adaptive inference 
there are some striking differences between adaptive 
confidence intervals and adaptive estimation. As we 
discussed in the earlier sections, adaptation for free 
is often possible under integrated squared error loss 
and the cost of adaptation is typically a logarith- 
mic factor under pointwise squared error loss. For 
confidence intervals the cost of adaptation can be 
substantially more than that for estimation. In fact, 
in some common cases, the cost of adaptation is so 
high that adaptation becomes basically impossible. 
In these cases the maximum expected length of the 
confidence interval over any parameter space in the 
collection needs essentially to be equal to the max- 
imum expected length over the whole collection in 
order for the confidence interval to have the desired 
coverage probability. See Low (1997). 

An adaptation theory for confidence intervals was 
developed in Cai and Low (2004b). In light of the 
discussion on adaptive estimation given in Section 3, 
a natural goal for adaptive confidence intervals over 
a collection of parameter spaces {J~i,i S 1} is to have 
a given coverage probability 1 — 7 over the union 
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of the parameter spaces J- = IJiei ^~i an< ^ have the 
maximum expected length over each space within 
a constant factor of the corresponding minimax ex- 
pected length, that is, 



(59) 



L(CI,Fi)<CiL*JFi) 



where Cj are constants. Unfortunately, in many com- 
mon cases such adaptive confidence intervals do not 
exist even for two parameter spaces. Let {J-\, J-2} be 
a pair of convex parameter spaces with nonempty 
intersection. Let T = T\ U Ti and < 7 < \. It 
shown in Cai and Low (2004b) that for i = 1, 2 



is 



(60) 



inf L(CI^) 



~ ( 2 ~TJ w +( z 7 n 1,2 iFh :F )i 



where the between class modulus uj + is defined 
in (55). The lower bound (60) can in fact be at- 
tained within a constant factor not depending on n. 
A general recipe, which relies on the ordered modu- 
lus uj{e,Fi,Fj) defined in (56), is given in Cai and 
Low (2004b) for the construction of confidence inter- 
vals which attains the lower bound within a constant 
factor. 

The lower bound (60), however, can be dramati- 
cally larger than the minimax expected length if the 
parameter space is prespecified. Such is the case for 
pointwise confidence intervals over Besov balls. Con- 
sider constructing a confidence interval for a func- 
tion at a point to £ (0,1) over two Besov balls based 
on the white noise model. In this case the linear 
functional T(f) = f(t ). Let T { = B% q% (Mi) with 
Vi = oci — 1/pi for i = 1, 2, T = F\ U Ti and suppose 
v \ > u 2 > 0. Then standard calculations, as in, for 
example, Donoho and Liu (1987), show 

u) + (e,J 7 i ,J 7 ) = u(e,J 7 ) 

= C£ 2v 2 /(l+2v 2 ) (1 + 0(1)), 



1,2. 



Thus, any 1 — 7 level confidence intervals over both 
B^ (Mi) and Bp^ q2 (M<2) must have the maximum 
expected length over -B" 1 (M\ ) satisfying 



(61) 



L(CI 1 B^ qi (M l )) 

> (i- 7 V+(%n" 1 / 2 , J B^ gi (M 1 ),^) 
xw(z 7 n~ ' , J 7 ) 



In contrast, if it is known that / G Bi*^ (Mi), 1 — 7 
level confidence intervals can be constructed which 



satisfy 

L(CI,B£ >qi (Mi)) < Cn~ Ul/{1+2vi) < Cn- U ^( 1+2U2 \ 

From (61), the rate of convergence of the maximum 
expected length of CI over ^^(Mi) is the same 
as that for the maximum expected length over T . 
From this point of view the cost of adaptation is so 
high that adaptation is impossible. 

It is also interesting to note an important differ- 
ence between parametric confidence intervals and 
nonparametric intervals. In the parametric setting, 
a universal practice for the construction of a confi- 
dence interval is to first obtain an optimal estimator 
of a parameter and then construct a confidence in- 
terval for the parameter centered at this estimator. 
Such a method often leads to an optimal confidence 
interval for the parameter. That is, the confidence 
interval has a desired coverage probability and the 
length of the interval is the shortest. In nonparamet- 
ric function estimation, it is also a common practice 
to center confidence intervals on optimally adaptive 
estimators. However, somewhat surprisingly, this in 
general leads to suboptimal confidence procedures 
(Cai and Low, 2005c). That is, either the confidence 
interval has poor coverage probability or it is unnec- 
essarily long. It is instructive to consider an exam- 
ple. 

Let us return to the problem of constructing a con- 
fidence interval for /(to) over the two Besov balls 
B"^ (Mi), i = 1, 2. Again let V{ = ai — 1/pi for i = 

1,2 and suppose v\ > V2 > 0. Equation (61) shows 
that any confidence interval with coverage probabil- 
ity of at least 1 — 7 over B^ q2 (M2) must have the 
maximum expected length of the order n~ U2 ' v l+2v 'z) 
over both B^ qx (Mi) and B^ q2 (M 2 ). This bound 
can easily be attained by using an optimal fixed 
length confidence interval. Now suppose /(to) is an 
adaptive estimator under the mean squared error. 
Then, in particular, /(to) has the maximum risk 
over B" 1 (Mi) converging at a rate n~ r where r > 

2fe 
l+2/3 2 



. It follows from the results in Cai and Low 



(2005c) that any confidence interval CI centered 
a t /(to) with coverage probability of at least 1 — 7 
over Bp 2 (M2) must satisfy for some constant C > 

/log?iW (1+2i/2) 
L(CI,B«l q2 (M 2 ))>c( 

(62) 



n 



>n 



-u 2 /{l+2u 2 ) 



Hence, confidence intervals centered at a mean squa- 
red error rate adaptive estimator must have a longer 
maximum expected length over i?" 2 (M2). 
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An interesting question is when adaptive confi- 
dence intervals exist? It can be seen easily by com- 
paring the lower bound (60) with the bounds (58) 
for the minimax expected length that adaptive confi- 
dence intervals exist if and only if the moduli satisfy 



w + (e 1 7j,J)xw(e,7i) I 



1,2, 



or, equivalently, u(e,J r 2) < Ciui{e,F\) < C2^+(e, 
J-i,J-~2). In this case adaptive confidence intervals 
exist. These intervals have maximum expected length 
which can attain the same optimal rate of conver- 
gence as the minimax confidence interval over 
known Ti- This is the case for certain shape re- 
stricted function spaces. 

Consider constructing pointwise confidence inter- 
vals for monotonically decreasing Lipschitz functions. 
Again, in this case let T(f) = /(to) with < to < 1- 
Let T> be the set of all decreasing functions on the 
unit interval and for < j3 < 1 let 

LipP(M) = {f:[0,l]^R, 

\f(x)-f(y)\<M\x-yf}. 

Let T>P{M) = VC\Lip li {M) be the collection of mono- 
tonically decreasing Lipschitz functions. Note that 
for < (3 2 < Pi < 1, V^{M) C V^{M). Let T = 
Uo</3<i^ (M). Then standard calculations yield 

(64) =Lo{e,vP(M)) 

= (2/3 + l)V(2/3+l) M l/(2/3+l) £ 2/3/(2/3+l)_ 

The adaptive confidence interval CI* given in equa- 
tion (34) of Cai and Low (2004b) has coverage prob- 
ability of at least 1 — 7 over T and satisfies for any 
0</3<l 

L(CI*,VP(M)) 

(65) < 12(2/3 + i)i/(2/m) M i/(2^ + i)^/(2/m) 



' 7 /2 



. n -W+i)(i + o( i)). 



Hence, the adaptive confidence interval CI* simulta- 
neously achieves with a constant factor of the mini- 
max expected length over all T>" (M) with < j3 < 1 . 
Adaptive confidence intervals also exist for convex 
functions. See Cai and Low (2007). 

6. CONCLUDING REMARKS 

From linear estimators in Pinsker's solution to the 
ellipsoid problem to separable rules in Donoho and 
Johnstone's approach to minimax estimation over 



Besov balls to thresholding estimators such as block- 
wise James-Stein in adaptive wavelet estimation, 
shrinkage plays a pivotal role in both the minimax 
theory and the adaptation theory in nonparametric 
function estimation. In particular, block threshold- 
ing can be viewed as a bridge between the classi- 
cal normal decision theory and nonparametric func- 
tion estimation. Through block thresholding, many 
shrinkage estimators developed in the classical the- 
ory can be used for function estimation. 

The three problems discussed in the paper are 
strongly connected. The minimax difficulty of es- 
timation can be characterized by the modulus of 
continuity and the cost of adaptation is captured 
by the between class modulus. The linear minimax- 
ity and minimaxity in these three problems are all 
linked to the one-dimensional bounded normal mean 
problem. In all three problems the performance of 
linear procedures is closely linked to the (quadratic) 
convexity of the parameter space. Linear shrinkage 
rules are near optimal when the parameter space is 
convex (quadratically convex in the case of global es- 
timation), and linear procedures can be arbitrarily 
far from being minimax when the parameter space 
is not convex. 

Although the minimax theories for the three prob- 
lems are similar, the adaptation theories are remark- 
ably different. Among the three problems, the adap- 
tation results are most positive for estimation under 
the global MISE risk. In this case adaptation for free 
can be achieved. On the other hand, the results for 
adaptive confidence intervals are very pessimistic in 
general. The cost of adaptation is so high that adap- 
tation over commonly used smoothness spaces is vir- 
tually impossible, although adaptation for free can 
be achieved over shape restricted spaces. These re- 
sults indicate that, while the traditional smoothness 
constraint works well for estimation, it may not be 
a practical or correct formulation for the construc- 
tion of adaptive nonparametric confidence intervals 
or bands. Alternative formulations are needed. Gen- 
ovese and Wasserman (2008) is one step in this di- 
rection. 

In this paper we have chosen to focus the discus- 
sion on the canonical white noise with drift model 
to avoid some of the nonessential technical complica- 
tions. Parallel results hold for nonparametric regres- 
sion and density estimation. We should emphasize 
that the discussion as well as the references given 
in this paper are by no means extensive. Interested 
readers are referred to Johnstone (2002) for further 
discussion and for a large number of additional refer- 
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ences on estimation under global integrated squared 
error loss. 
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